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Numerous new cache architectures, such as phased, pseudo-set-associative, way 
predicting, reactive-associative, way-shutdown, way-concatenating, and highly-associative, 
are intended to reduce power and/or energy, but they all impose some performance 
overhead. We have developed a new cache architecture, called a way-halting cache, that 
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Caches contribute to much of a microprocessor system's power and energy consumption. 
We have developed a new cache architecture, called a way-halting cache, that reduces 
energy while imposing no performance overhead. Our way-halting cache is a four-way set- 
associative cache that stores the four lowest-order bits of all ways' tags into a fully 
associative memory, which we call the halt tag array. The lookup in the halt tag array is 
done in parallel with, and is no slower than, the set-index decod ... 
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Scratch-pad memories (SPMs) enable fast access to time-critical data. While prior research 
studied both static and dynamic SPM management strategies, not being able to keep all hot 
data (i.e., data with high reuse) in the SPM remains the biggest problem. This paper 
proposes data compression to increase the number of data blocks that can be kept in the 
SPM. Our experiments with several embedded applications show that our compression- 
based SPM management heuristic is very effective and outperforms ... 
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Pointer-chasing applications tend to traverse composite data structures consisting of 
multiple independent pointer chains. While the traversal of any single pointer chain leads to 
the serialization of memory operations, the traversal of independent pointer chains provides 
a source of memory parallelism. This article investigates exploiting such interchain memory 
parallelism for the purpose of memory latency tolerance, using a technique called multi- 
chain prefetching. Previous work ... 
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Tile-based data layout has been applied to achieve various objectives such as minimizing 
cache conflicts and memory row switching activity. In some applications of tile-based 
mapping, the size of the tile can be assumed to be a power of two. In this paper, this power 
of two' assumption has been used to drastically simplify the tile-based address mapping 
functions. Once optimized, the implementation of the non-linear tile-based mapping 
consumes 60% less power than the implementation of the linear ... 
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Instruction caches have traditionally been used to improve software performance. Recently, 
several tiny instruction cache designs, including filter caches and dynamic loop caches, have 
been proposed to instead reduce software power. We propose several new tiny instruction 
cache designs, including preloaded loop caches, and one-level and two-level hybrid 
dynamic/preloaded loop caches. We evaluate the existing and proposed designs on 
embedded system software benchmarks from both the Powerstone and ... 

Keywords: Loop cache, architecture tuning, embedded systems., filter cache, fixed 
program, instruction cache, low energy, low power 
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chip„hardware 

Ann Gordon-Ross, Frank Vahid 

October 2003 Proceedings of the 2003 international conference on Compilers, 
architecture and synthesis for embedded systems 

Full text available: *f|) pdf(278.07 KB) Additional Information: full citation abstract, references , index terms 

Dynamic software optimization methods are becoming increasingly popular for improving 
software performance and power. The first step in dynamic optimization consists of 
detecting frequently executed code, or "critical regions." Previous critical region detectors 
have been targeted to desktop processors. We introduce a critical region detector targeted 
to embedded processors, with the unique features of being very size and power efficient, 
and being completely non-intrusive to the software's exec ... 

Keywords: dynamic optimization, frequent loop detection, frequent value profiling, 
hardware profiling, hot spot detection, on-chip profiling, runtime profiling 
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Timothy Sherwood, George Varghese, Brad Calder 

May 2003 ACM SZGARCH Computer Architecture News , Proceedings of the 30th 

annual international symposium on Computer architecture, volume 3i issue 2 
Full text available: ^ Pdff213.65 KB) Additional Information: full citation, abstract references, citings 

Designing ASICs for each new generation of backbone routers is a time intensive and 
fiscally draining process. In this paper we focus on the design of a programmable 
architecture for backbone routers, based on the manipulation of wide irregular memory 
words, that can provide a feasible design alternative to custom ASICs. We propose a 
pipelined memory design that emphasizes worst-case throughput over latency, and co- 
explore architectural tradeoffs with the design of several important network algo ... 
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In this paper we present the internal representation and optimizations used by the CASH 
compiler for improving the memory parallelism of pointer-based programs. CASH uses an 
SSA-based representation for memory, which compactly summarizes both control-flow-and 
dependence information. In CASH, memory optimization is a four-step process: (l)first an 
initial, relatively coarse, representation of memory dependences is built; (2) next, 
unnecessary memory dependences are removed using dependence tests; ... 

12 Enersy-aware design.o^^ §^ 

Luca Benini, Alberto Macii, Massimo Poncino 

February 2003 ACM Transactions on Embedded Computing Systems (TECS), volume 2 issue 



Full text available: ^p.d"E2§A.44_KBj Additional Information: MLcJMion, abstract, references, iodexjerms 



Embedded systems are often designed under stringent energy consumption budgets, to 
limit heat generation and battery size. Since memory systems consume a significant 
amount of energy to store and to forward data, it is then imperative to balance power 
consumption and performance in memory system design. Contemporary system design 
focuses on the trade-off between performance and energy consumption in processing and 
storage units, as well as in their interconnections. Although memory design is as ... 

Keywords: Embedded systems, embedded memories, integration, memories, nonvolatile, 
system-on-a-chip, volatile 
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stencil codes 
Claudia Leopold 

March 2002 Proceedings of the 2002 ACM symposium on Applied computing 



Iterative solvers such as the Jacobi and Gauss-Seidel relaxation methods are important, but 
time-consuming building blocks of many scientific and engineering applications. The 
performance problems are largely due to cache misses, and can be reduced by tiling the 
codes. Whereas previous research has shown the usefulness of tiling by experimentally 
comparing the run times of tiled and original codes, it did not tackle the question, as to 
whether further improvements are possible. In this paper, we ... 

Keywords: data locality, lower bounds, relaxation methods, tiling 
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The ideal goal of not having the user dealing with concurrency aspects has proven hard to 
achieve in the context of system (compiler, run-time) supported automatic parallelization 
for general purpose languages and applications. More focused approaches, of automatic 
parallelization for numerical applications with a regular structure have been successful. Still, 
they cannot fully handle irregular applications (e.g the solution of Partial Differential 
Equations (PDEs) for general geometries) .This p ... 

Keywords: concurrency, distributed components, generic programming, scientific 
applications 
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Shobha Singh, Shamsi Azmi, Nutan Aarawai, Penaka Phani, Ansuman Rout 
January 2002 Proceedings of the 2002 conference on Asia South Pacific design 
automation/VLSI Design 

Full text available: p pdff 166.65 KB) 

Mi Additional Information: full citation, abstract 

ifcubMberSM 

Critical issues in designing a high speed, low power static RAM in deep submicron 
technologies are described along with the design techniques used to overcome them. With 
appropriate circuit partioning, transistor sizing, choice of a suitable Sense Amplifier, a good 
resetting technique and judicial use of dual Vth transistors we have achieved a high speed 
memory without dissipating too much power. The Introduction gives the specifications of 
the memory that was our design target. In Section II, w ... 

16 Morph able. Cache. Ar^ 

I. Kadayif, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, J. Ramanujam 
August 2001 ACM SIGPLAN Notices, Volume 36 issues 

Full text available: f^ Ddf/302.99 KB) Additional Information: full citation , abstract references , citirjc^ index 

terms 

Computer architects have tried to mitigate the consequences of high memory latencies 
using a variety techniques. An example of these techniques is multi-level caches to 
counteract the latency that results from having a memory that is slower than the processor. 
Recent research has demonstrated that compiler optimizations that modify data layouts and 
restructure computation can be successful in improving memory system performance. 
However, in many cases, working with a fixed cache configuration ... 

17 Compiler-based I/O prefetching for out-of-core applications 
Angela Demke Brown, Todd C. Mowry, Orran Krieger 

May 2001 ACM Transactions on Computer Systems (TOCS), volume 19 issue 2 

Full text available: HlfidP499.Q3 KBi Additional Information: fyjLcit^jon, abstract, references, citinas, index 
" ^ ***"""" * terms, review 

Current operating systems offer poor performance when a numeric application's working set 
does not fit in main memory. As a result, programmers who wish to solve "out-of-core" 
problems efficiently are typically faced with the onerous task of rewriting an application to 
use explicit I/O operations (e.g., read/write). In this paper, we propose and evaluate a fully 
automatic technique which liberates the programmer from this task, provides high 
performance, and requires only minima ... 

Keywords: compiler optimization, prefetching, virtual memory 
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Lode Nachtergaele, VivekTiwari, Nikil Dutt 

November 2000 Proceedings of the 2000 IEEE/ACM international conference on 
Computer-aided design 

Full text available: ^odf(34.99 KB) Additional Information: full citation, abstract, references, citings 

Current microprocessor architectures become more and more dominated by the data access 
bottlenecks in the cache, system bus and main memory subsystems. These also have a 
major influence on the system (board-level) power consumption. In practice this means 
lower energy consumption for a given throughput requirement.In the booming domain of 
(largely embedded) cost-sensitive communication and multi-media applications, more and 
more implementations make use of microprocessor based platforms for flex ... 



19 Landing H 
path 

Kevin B. Theobald, Gagan Agrawal, Rishi Kumar, Gerd Heber, Guang R. Gao, Paul Stodghill, 
Keshav Pingali 

November 2000 Proceedings of the 2000 ACM/IEEE conference on Supercomputing 
(CDROM) 

Full text available: f| pdtf 150.46 KB) 

is? Additional Information: full citation , abstract, references, index terms 

W. Publisher Site 

We report on our work in developing a fine-grained multithreaded solution for the 
communication-intensive Conjugate Gradient (CG) problem. In our recent work, we have 
developed a simple, yet very efficient, solution to executing matrix-vector multiply on a 
multithreaded system. This paper presents an effective mechanism for the reduction- 
broadcast phase, which is implemented and integrated with the sparse MVM resulting in a 
scalable implementation of the complete CG application. 

20 A recursive, algori Q 
Luca Benini, Alberto Macii, Massimo Poncino 

August 2000 Proceedings of the 2000 international symposium on Low power 
electronics and design 

Full text available' W pdtp.24 58 KB) Additional Information: MLcjtatjon, abstract references, citings, index 
' ^ """"""* terms 

Memory-processor integration offers new opportunities for reducing the energy of a system. 
In the case of embedded systems, one solution consists of mapping the most frequently 
accessed addresses onto the on-chip SRAM to guarantee power and performance efficiency. 
This option is especially effective when memory access patterns can be profiled and studied 
at design time (as in typical real-time embedded systems). In this work, we propose an 
algorithm for the automatic partitioning ... 
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21 A compiler method for the parallel execution of irregular reductions in scalable shared 
memory.^ 

E. Gutierrez, O. Plata, E. L. Zapata 

May 2000 Proceedings of the 14th international conference on Supercomputing 

Additional Information: full citation, abstract, references, citings , index 
terms 



Full text available: "If) pd ft 898. 76 KB? 



This paper presents a new parallelization method for reductions of arrays with subscripted 
subscripts on scalable shared memory multiprocessors. The mapping of computations is 
based on grouping reduction loop iterations into sets that are further assigned to the 
cooperating threads of computation. Iterations belonging to the same set are chosen in 
such a way that update different entries in the reduction array. That is, the loop distribution 
implies a conflict-free write distribution of the ... 

22 Cache.mjss„egu^ 
behavior 

Somnath Ghosh, Margaret Martonosi, Sharad Malik 

July 1999 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 21 Issue 4 

Additional Information: full citation, abstract, references , citings, index 
terms, review 



Full text available: "§| pdf(548.16 KB? 



With the ever-widening performance gap between processors and main memory, cache 
memory, which is used to bridge this gap, is becoming more and more significant. Caches 
work well for programs that exhibit sufficient locality. Other programs, however, have 
reference patterns that fail to exploit the cache, thereby suffering heavily from high 
memory latency. In order to get high cache efficiency and achieve good program 
performance, efficient memory accessing behavior is necessary. In fact, f ... 



Keywords: cache memories, compilation, optimization, program transformation 
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We present a system that allows OpenMP programs to execute on a network of 
workstations with a variable number of nodes. The ability to adapt to a variable number of 
nodes allows a program to take advantage of additional nodes that become available after it 
starts execution, or to gracefully scale down when the number of available nodes is 
reduced. We demonstrate that the cost of adaptation is modest; the system allows a 
program to adapt at a moderate rate without much performance loss.Two ideas ... 

24 OpenMP on networks of workstations Q 
Honghui Lu, Y. Charlie Hu, Willy Zwaenepoel 

November 1998 Proceedings of the 199S ACM/IEEE conference on Supercomputing 
(CDROM) 

Full text available: *|| | pdff 202.91 KB) Additional Information: full citation, abstract, references, citings 

We describe an implementation of a sizable subset of OpenMP on networks of workstations 
(NOWs). By extending the availability of OpenMP to NOWs, we overcome one of its primary 
drawbacks compared to MPI, namely lack of portability to environments other than 
hardware shared memory machines. In order to support OpenMP execution on NOWs, our 
compiler targets a software distributed shared memory system (DSM) which provides multi- 
threaded execution and memory consistency .This paper presents two contri ... 

25 AutomMc^ Q 
Ken Kennedy, uirich Kremer 

July 1998 ACM Transactions on Programming Languages and Systems (TOP LAS) r 

Volume 20 Issue 4 

Full text available: f| pdl633L20 KB). Additional Information: Ml.citation, abstract references, citings, index 
^ terms, review 

The goal of languages like Fortran D or High Performance Fortran (HPF) is to provide a 
simple yet efficient machine-independent parallel programming model. After the algorithm 
selection, the data layout choice is the key intellectual challenge in writing an efficient 
program in such languages. The performance of a data layout depends on the target 
compilation system, the target machine, the problem size, and the number of available 
processors. This makes the choice of a good layout extremel ... 

Keywords: high performance Fortran 



26 Tolerating iatency in multiprocessors through compiler-inserted prefetching 
Todd C. Mowry 

February 1998 ACM Transactions on Computer Systems (TOCS), volume 16 issue l 
Full text available: ®.pdM1 0,70 KB) Additional Information: Ml.citation, abstract, references, citings, index 



Terms, review 



The large latency of memory accesses in large-scale shared-memory multiprocessors is a 
key obstacle to achieving high processor utilization. Software-controlled prefetching is a 
technique for tolerating memory latency by explicitly executing instructions to move data 
close to the processor before the data are actually needed. To minimize the burden on the 
programmer, compiler support is needed to automatically insert prefetch instructions into 
the code. A key challenge when ... 

Keywords: compiler optimization, prefetching 
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Using dataflow analysis techniques to reduce ownership overhead in cache coherence 
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Jonas Skeppstedt, Per Stenstrom 

November 1 996 ACM Transactions on Programming Languages and Systems (TOPLAS), 
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Full text available: ra.pdft284.o8 KB i — : ' ' ' 

In this article, we explore the potential of classical dataflow analysis techniques in removing 
overhead in write-invalidate cache coherence protocols for shared-memory multiprocessors. 
We construct the compiler algorithms with varying degree of sophistication that detect loads 
followed by stores to the same address. Such loads are marked and constitute a hint to the 
cache to obtain an exclusive copy of the block so that the subsequent store does not 
introduce access penalties. The simplest ... 

Keywords: cache coherence, dataflow analysis, performance evaluation 

28 Modeling ASIC memories in VHDL Q 
E. Balaji, P. Krishnamurthy 

September 1996 Proceedings of the conference on European design automation 

Full text available: ^.pdf(85,09.KB] Additional Information: Mlcriatjon, references, ]nde>s.terms 



29 Compiler-directed page coloring for multiprocessors 
Edouard Bugnion, Jennifer M. Anderson, Todd C. Mowry, Mendel Rosenblum, Monica S. Lam 
September 1996 Proceedings of the seventh international conference on Architectural 

support for programming languages and operating systems, volume 31 , 

30 Issue 9 , 5 

p ii I j i u. « JfH ^-7 s^nv Additional Information: full citation, abstract, references, citings, index 

Full text available: tIi pdf(1.37 MBj ' 1 ' 
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This paper presents a new technique, compiler-directed page coloring, that eliminates 
conflict misses in multiprocessor applications. It enables applications to make better use of 
the increased aggregate cache size available in a multiprocessor. This technique uses the 
compiler's knowledge of the access patterns of the parallelized applications to direct the 
operating system's virtual memory page mapping strategy. We demonstrate that this 
technique can lead to significant performance impr ... 

30 J.!T!PlQyjng. data, jora 

Kathryn S. McKinley, Steve Carr, Chau-Wen Tseng 

July 1996 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 18 Issue 4 

i- ii* ^ , , , fik *u aaa A^isrs, Additional Information: full citation, abstract references, citings, index 

Full text available: odf(41 1 .40 KB) ' J ' 5 *-' t 

terms, review 

In the past decade, processor speed has become significantly faster than memory speed. 
Small, fast cache memories are designed to overcome this discrepancy, but they are only 
effective when programs exhibit data locality. In the this article, we present compiler 
optimizations to improve data locality based on a simple yet accurate cost model. The 
model computes both temporal and spatial reuse of cache lines to find desirable loop 
organizati ... 

Keywords: Cache, compiler optimization, data locality, loop distribution, loop fusion, loop 
permutation, loop reversal, loop transformations, microprocessors, simulation 
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32 Cpmpiter support fOT^ 
Frank Mueller 

November 1995 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 1995 

workshop on Languages, compilers, & tools for real-time systems, 

Volume 30 Issue 11 

Full text available: t »pdff886.21 KB) Additional Information: fe]l citation, abstract, references, citing index 
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Cache memories have become an essential part of modern processors to bridge the 
increasing gap between fast processors and slower main memory. Until recently, cache 
memories were thought to impose unpredictable execution time behavior for hard real-time 
systems. But recent results show that the speedup of caches can be exploited without a 
significant sacrifice of predictability. These results were obtained under the assumption that 
real-time tasks be scheduled non-preemptively.This paper ... 

33 Compiler optlmizMl^^ 
Chau-Wen Tseng 

August 1995 ACM SIGPLAN Notices , Proceedings of the fifth ACM SIGPLAN symposium 

on Principles and practice of parallel programming, volume 30 issue s. 
Full text available: IS pdfil 38 MB) Additional Information: full citation, abstract, references , citings, I ndex 
' m terms 

This paper presents novel compiler optimizations for reducing synchronization overhead in 
compiler-parallelized scientific codes. A hybrid programming model is employed to combine 
the flexibility of the fork-join model with the precision and power of the single-program, 
multiple data (SPMD) model. By exploiting compile-time computation partitions, 
communication analysis can eliminate barrier synchronization or replace it with less 
expensive forms of synchronization. We show computation part ... 

34 .Qata.and.computatb 

Jennifer M. Anderson, Saman P. Amarasinghe, Monica S. Lam 

August 1995 ACM SIGPLAN Notices , Proceedings of the fifth ACM SIGPLAN symposium 
on Principles and practice of parallel programming, volume 30 issue 8 

Additional Information: full citation, abstract, references, citings , index 



Full text available: f8 pdf(1.44 MB) 
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Effective memory hierarchy utilization is critical to the performance of modern 
multiprocessor architectures. We have developed the first compiler system that fully 
automatically parallelizes sequential programs and changes the original array layouts to 
improve memory system performance. Our optimization algorithm consists of two steps. 
The first step chooses the parallelization and computation assignment such that 
synchronization and data sharing are minimized. The second step then restruc ... 

35 Unified compilation techniques for shared and distributed address space machines 
Chau-Wen Tseng, Jennifer M. Anderson, Saman P. Amarasinghe, Monica S. Lam 
July 1995 Proceedings of the 9th international conference on Supercomputing 
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Distributed shared memory (DSM) systems provide an illusion of shared memory on 
distributed memory systems such as workstation networks and some parallel computers 
such as the Cray T3D and Convex SPP-1. This illusion is provided either by enhancements 
to hardware, software, or a combination thereof. On these systems, users can write 
programs using a shared memory style of programming instead of message passing which 
is tedious and error prone. Our experience with one such system, TreadMarks, has ... 

37 Simple compiler algorithms to reduce ownership overhead in cache coherence 
protocols 

Jonas Skeppstedt, Per Stenstrom 

November 1994 Proceedings of the sixth international conference on Architectural 

support for programming languages and operating systems, volume 29 , 

28 Issue 11 , 5 

Full text available- "fU odfifl 47 MBi Additional Information: Ml .citation, abstract, references, citings, index 
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We study in this paper the design and efficiency of compiler algorithms that remove 
ownership overhead in shared-memory multiprocessors with write-invalidate protocols. 
These algorithms detect loads followed by stores to the same address. Such loads are 
marked and constitute a hint to the cache to obtain an exclusive copy of the block. We 
consider three algorithms where the first one focuses on load-store sequences within each 
basic block of code and the other two analyse the existence of I ... 
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November 1994 Proceedings of the sixth international conference on Architectural 

support for programming languages and operating systems, volume 29 , 

28 Issue 11 , 5 
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In the past decade, processor speed has become significantly faster than memory speed. 
Small, fast cache memories are designed to overcome this discrepancy, but they are only 
effective when programs exhibit data locality. In this paper, we present compiler 
optimizations to improve data locality based on a simple yet accurate cost model. The 
model computes both temporal and spatial reuse of cache lines to find desirable loop 
organizations. T ... 
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July 1994 Proceedings of the 8th international conference on Supercomputing 

r- I.* ^ ■, Ul ah rr Additional Information: full citation, abstract, references, citings, index 

Full text available: *Wl pdf[ 1.05 MBj ; 

Terms 

There are many applications in CFD and structural analysis that can be more accurately 
modeled using unstructured grids. Parallelization of implicit methods for unstructured grids 
is a difficult and important problem. This paper deals with coloring techniques to overlap 
computation and communication during the solution of implicit methods on message 
passing distributed memory multicomputers. An evaluation of coloring techniques for 
partitioned unstructured grids is first presented. Results ... 
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lack of a shared-memory programming performance model that can inform programmers of 
the cost of operations (so they can avoid expensive ones) and can tell hardware designers 
which cases are common (so they can build simple hardware to optimize them). 
Cooperative shared memory, our approach to shared-memory design, addresses this 
problem. Our initial implementation of cooperative shared memory u ... 

Keywords: cache coherence, directory protocols, memory systems, programming model, 
shared-memory multiprocessors 
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One of the most challenging steps in developing a parallel program for a distributed 
memory machine is determining how data should be distributed across processors. Most of 
the compilers being developed to make it easier to program such machines still provide no 
assistance to the programmer in this difficult and machine-dependent task. We have 
developed PARADIGM, a compiler that makes data partitioning decisions for Fortran 77 
procedures. A significant feature of the design of PARADIGM is t ... 
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A hardware prefetching mechanism named Speculative Prefetching is proposed. This 
scheme detects vector accesses issued by a load/store instruction and prefetches the 
corresponding data. The scheme requires no software add-on, and in some cases it is more 
powerful than software techniques for identifying regular accesses. The tradeoffs related to 
its hardware implementation are extensively discussed in order to finely tune the 
mechanism. Experiments show that average memory ... 
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Mapping data to parallel computers aims at minimizing the execution time of the associated 
application. However, it can take an unacceptable amount of time in comparison with the 
execution time of the application if the size of the problem is large. In this paper, first we 
motivate the case for graph contraction as a means for reducing the problem size. We 
restrict our discussion to applications where the problem domain can be described using a 
graph (e.g., computational fluid dynamics appl ... 
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We believe the absence of massively-parallel, shared-memory machines follows from the 
lack of a shared-memory programming performance model that can inform programmers of 
the cost of operations (so they can avoid expensive ones) and can tell hardware designers 
which cases are common (so they can build simple hardware to optimize them). 
Cooperative shared memory, our approach to shared-memory design, addresses this 
problem. Our initial implementation of cooperativ ... 
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Regular meshes are frequently used for modeling physical phenomena on both serial and 
parallel computers. One advantage of regular meshes is that efficient discretization 
schemes can be implemented in a straightforward manner. However, geometrically- 
complex objects, such as aircraft, cannot be easily described using a single regular mesh. 
Multiple interacting regular meshes are frequently used to describe complex geometries. 
Each mesh models a subregion of the physical domain. The meshes, o ... 
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We describe a parallel programming tool for scheduling static task graphs and generating 
the appropriate target code for message passing MIMD architectures. The computational 
complexity of the system is almost linear to the size of the task graph and preliminary 
experiments show performance comparable to the "best" hand-written programs. 
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Memory referencing behavior is analyzed via the study of traces for the purpose of 
developing new local memory structures and management techniques. A novel trace 
processing technique called flattening reduces the dependence of the results on the 
underlying compiler and architecture on which the trace was generated, and partitions each 
memory location into its constituent values. The referencing patterns of each value in the 
resulting trace is described via ... 

57 Associative and Parallel Processors Q 
Kenneth J. Thurber, Leon D. Wald 

December 1975 ACM Computing Surveys (CSUR), volume 7 issue 4 

Full text available: ^.pdg2.62.MBJ. Additional Information: MLcrtetjon, references, citings, index teijns 



Results 41 - 57 of 57 Result page: previous .1 2 3 

The ACM Portal is published by the Association for Computing Machinery. Copyright © 2005 ACM, Inc. 

Terms of Usage Privacy Poiicy Code of Ethics Contact Us 

Useful downloads: HI Adobe Acrobat ^.Qujcklirne IS WnaWs Media .Player ^■.ReaJ.PJayer 



http://portal.acm.org/resultsxfo^ 6/10/05 



